Recognition of HTML Table Structure

نویسندگان

  • Hidetaka MASUDA
  • Shuichi TSUKAMOTO
  • Hiroshi NAKAGAWA
چکیده

Tables in HTML Web pages have become precious knowledge sources. Therefore it is reasonable and necessary to develop an algorithm to extract knowledge from them. For this, we need a system to identify the boundary between attributes and values of a table in HTML and transform tables into more understandable attributevalue pairs. In this paper, we propose an algorithm for this purpose. The outline of the algorithm is that if we find a row(or column) having low similarity with other rows (or columns), it is probably an attribute name row (or column), otherwise value data rows(or columns). The algorithm based on this idea results in 82% accuracy of recognition of lengthways and 78% accuracy of recognition of sideways for 300 tables in HTML of Web pages downloaded from the Web.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Mining Tables from Large Scale HTML Texts

Table is a very common presentation scheme, but few papers touch on table extraction in text data mining. This paper focuses on mining tables from large-scale HTML texts. Table filtering, recognition, interpretation, and presentation are discussed. Heuristic rules and cell similarities are employed to identify tables. The F-measure of table recognition is 86.50%. We also propose an algorithm to...

متن کامل

Layout and Language: Challenges for Table Understanding on the Web

In this paper, we consider the table understanding task and present a catalogue of particular issues that arise when the tables are those found on the web. In addition, we consider what happens when processes commonly associated with web pages are applied to those bearing tables. 1 Table Understanding and the Web The ubiquity of tables, and their ability to describe relational information in a ...

متن کامل

Notes on Contemporary Table Recognition

The shift of interest to web tables in HTML and PDF files, coupled with the incorporation of table analysis and conversion routines in commercial desktop document processing software, are likely to turn table recognition into more of a systems than an algorithmic issue. We illustrate the transition by some actual examples of web table conversion. We then suggest that the appropriate target form...

متن کامل

Automating the extraction of data from HTML tables with unknown structure

Data on the Web in HTML tables is mostly structured, but we usually do not know the structure in advance. Thus, we cannot directly query for data of interest. We propose a solution to this problem based on document-independent extraction ontologies. Our solution entails elements of table understanding, data integration, and wrapper creation. Table understanding allows us to find tables of inter...

متن کامل

Automatically Extracting Ontologically Specified Data from HTML Tables of Unknown Structure

Data on the Web in HTML tables is mostly structured, but we usually do not know the structure in advance. Thus, we cannot directly query for data of interest. We propose a solution to this problem based on document-independent extraction ontologies. The solution entails elements of table understanding, data integration, and wrapper creation. Table understanding allows us to recognize attributes...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004